252 research outputs found

    Banded structure in binary matrices

    Get PDF
    International audienceA binary matrix has a banded structure if both rows and columns can be permuted so that the non-zero entries exhibit a staircase pattern of overlapping rows. The concept of banded matrices has its origins in numerical analysis, where entries can be viewed as descriptions between the problem variables; the bandedness corresponds to variables that are coupled over short distances. Banded data occurs also in other applications, for example in the physical mapping problem of the human genome, in paleontological data, in network data and in the discovery of overlapping communities without cycles. We study the banded structure of binary matrices, give a formal definition of the concept and discuss its theoretical properties. We consider the algorithmic problems of computing how far a matrix is from being banded, and of finding a good submatrix of the original data that exhibits approximate bandedness. Finally, we show by experiments on real data from ecology and other applications the usefulness of the concept. Our results reveal that bands exist in real datasets and that the final obtained orderings of rows and columns have natural interpretations

    Linkage Disequilibrium between Chromosomes in the Human Genome : Test Statistics and Rapid Computation

    Get PDF
    Linkage disequilibrium (LD) refers to the statistical dependency of the DNA content at nearby locations of the chromosome. Numerous approaches to analyze genome data rely on the well documented fact that LD decays monotonously with the distance of the studied loci. This decay, though noisy and modi ed by a number of factors, can be attributed to the recombination process, a major source of genetic variation in diploid organisms. In this work we take first steps toward analyzing the extent of LD between very distant loci, even loci from different chromosomes. This is in contrast to traditional "genome-wide" analyses which merely study the LD within each chromosome separately. We design several measures of LD, and use them for analyzing the HapMap data. We also consider LD between supermarkers determined by haplotype clusters in windows of a few SNPs. We report on suggestive pairs of loci where unusually large correlations are observed within all ethnic groups. We describe how the computations can be arranged in a way that enables an all-pairs analysis of the data, that is, all pairs of loci across all the 22 autosomal chromosomes. This kind of "genome times genome" analysis is computationally very burdensome due to the sheer number of possible pairs. We show ways to make it feasible

    Seriation in paleontological data using Markov Chain Monte Carlo methods

    Get PDF
    Given a collection of fossil sites with data about the taxa that occur in each site, the task in biochronology is to find good estimates for the ages or ordering of sites. We describe a full probabilistic model for fossil data. The parameters of the model are natural: the ordering of the sites, the origination and extinction times for each taxon, and the probabilities of different types of errors. We show that the posterior distributions of these parameters can be estimated reliably by using Markov chain Monte Carlo techniques. The posterior distributions of the model parameters can be used to answer many different questions about the data, including seriation (finding the best ordering of the sites) and outlier detection. We demonstrate the usefulness of the model and estimation method on synthetic data and on real data on large late Cenozoic mammals. As an example, for the sites with large number of occurrences of common genera, our methods give orderings, whose correlation with geochronologic ages is 0.95.Given a collection of fossil sites with data about the taxa that occur in each site, the task in biochronology is to find good estimates for the ages or ordering of sites. We describe a full probabilistic model for fossil data. The parameters of the model are natural: the ordering of the sites, the origination and extinction times for each taxon, and the probabilities of different types of errors. We show that the posterior distributions of these parameters can be estimated reliably by using Markov chain Monte Carlo techniques. The posterior distributions of the model parameters can be used to answer many different questions about the data, including seriation (finding the best ordering of the sites) and outlier detection. We demonstrate the usefulness of the model and estimation method on synthetic data and on real data on large late Cenozoic mammals. As an example, for the sites with large number of occurrences of common genera, our methods give orderings, whose correlation with geochronologic ages is 0.95.Given a collection of fossil sites with data about the taxa that occur in each site, the task in biochronology is to find good estimates for the ages or ordering of sites. We describe a full probabilistic model for fossil data. The parameters of the model are natural: the ordering of the sites, the origination and extinction times for each taxon, and the probabilities of different types of errors. We show that the posterior distributions of these parameters can be estimated reliably by using Markov chain Monte Carlo techniques. The posterior distributions of the model parameters can be used to answer many different questions about the data, including seriation (finding the best ordering of the sites) and outlier detection. We demonstrate the usefulness of the model and estimation method on synthetic data and on real data on large late Cenozoic mammals. As an example, for the sites with large number of occurrences of common genera, our methods give orderings, whose correlation with geochronologic ages is 0.95.Peer reviewe

    Significance testing of word frequencies in corpora

    Get PDF
    Finding out whether a word occurs significantly more often in one text or corpus than in another is an important question in analysing corpora. As noted by Kilgarriff (Language is never, ever, ever, random, Corpus Linguistics and Linguistic Theory, 2005; 1(2): 263–76.), the use of the X2 and log-likelihood ratio tests is problematic in this context, as they are based on the assumption that all samples are statistically independent of each other. However, words within a text are not independent. As pointed out in Kilgarriff (Comparing corpora, International Journal of Corpus Linguistics, 2001; 6(1): 1–37) and Paquot and Bestgen (Distinctive words in academic writing: a comparison of three statistical tests for keyword extraction. In Jucker, A., Schreier, D., and Hundt, M. (eds), Corpora: Pragmatics and Discourse. Amsterdam: Rodopi, 2009, pp. 247–69), it is possible to represent the data differently and employ other tests, such that we assume independence at the level of texts rather than individual words. This allows us to account for the distribution of words within a corpus. In this article we compare the significance estimates of various statistical tests in a controlled resampling experiment and in a practical setting, studying differences between texts produced by male and female fiction writers in the British National Corpus. We find that the choice of the test, and hence data representation, matters. We conclude that significance testing can be used to find consequential differences between corpora, but that assuming independence between all words may lead to overestimating the significance of the observed differences, especially for poorly dispersed words. We recommend the use of the t-test, Wilcoxon rank sum test, or bootstrap test for comparing word frequencies across corpora.Peer reviewe
    corecore